1. Data Processing
Reading the csv file and storing the datframe into covidtable:
covidtable = read.csv("C:/Users/Creacion Tech/Documents/covid_data.csv")The line below shows the first 10 columns of the data frame.
str(covidtable[,c(1:10)])## 'data.frame': 126 obs. of 10 variables:
## $ Sample : chr "C1" "C2" "C3" "C4" ...
## $ Age : int 39 63 33 49 49 40 38 78 64 62 ...
## $ Sex : chr "male" "male" "male" "male" ...
## $ Severity: chr "NonICU" "NonICU" "NonICU" "NonICU" ...
## $ A1BG : num 0.49 0.29 0.26 0.45 0.17 0.21 0.49 0.12 0.51 0.1 ...
## $ A1CF : num 0 0 0 0.01 0 0 0.01 0 0.01 0 ...
## $ A2M : num 0.21 0.14 0.03 0.09 0 0.08 0.23 0.08 0.88 0.13 ...
## $ A2ML1 : num 0.04 0 0.02 0.07 0.05 0.04 0.03 0.01 0.02 0.01 ...
## $ A3GALT2 : num 0.07 0 0 0 0.07 0 0.07 0 0.79 0.15 ...
## $ A4GALT : num 0 0 0 0 0 0 0 0 0 0 ...
The data frame shows that all the gene sequences have numeric variables and the rest have categorical data type.
We are predicting Severity using genome sequences only. So, removing other variables from the data set.
covidtable = covidtable[,-c(1:3)] Visualizing the frequency distribution of the ICU/NonICU target variable.
fig <- plot_ly(x = covidtable$Severity, type = "histogram", color = covidtable$Severity)%>%
layout(title = 'Frequency of each class')
figSo we are able to understand the distribution of Severity values by looking at the proportion of each class.
table(covidtable$Severity) / nrow(covidtable)##
## ICU NonICU
## 0.5238095 0.4761905
So, it is clearly a balanced distribution.
Changing the Severity “ICU” to 1 and “NonICU” to 0
covidtable$Severity[covidtable$Severity == "ICU"] = 1
covidtable$Severity[covidtable$Severity == "NonICU"] = 0Changing the data type of Severity from character to factor.
covidtable$Severity = as.factor(covidtable$Severity)Normalizing the genome sequences
Normalization is done to change the values of numeric columns in the data set to use a common scale, without distorting differences in the ranges of values or losing information.
# function to normalize data
Normalize_Data <- function(val) { return ((val - min(val)) / (max(val) - min(val))) }
for (col in 2:ncol(covidtable)) {
covidtable[,col] = Normalize_Data(covidtable[,col])
}Removing any NAs introduced after normalization
covidtable = covidtable[ , colSums(is.na(covidtable)) == 0]Analysing the independent variables
ncol(covidtable[,-c(1)])## [1] 18318
There are 18318 columns. So we need to select only the ones that have some correlation with the dependent variable.
Feature selection using Boruta
Main data frame is divided into covidtable1 and covidtable2 to avoid stack overload of the computer in feature selection function.
covidtable1 = covidtable[,c(1,2:10000)]
covidtable2 = covidtable[,c(1,10001:ncol(covidtable))]Boruta() and getSelectedAttributes() is applied on covidtable1 and covidtable2 separately and features returned by the function are stored in boruta_signif1 and boruta_signif2 respectively.
boruta_output1 <- Boruta(Severity ~ ., data=na.omit(covidtable1), doTrace=0)
boruta_signif1 <- getSelectedAttributes(boruta_output1, withTentative = TRUE)
boruta_output2 <- Boruta(Severity ~ ., data=na.omit(covidtable2), doTrace=0)
boruta_signif2 <- getSelectedAttributes(boruta_output2, withTentative = TRUE)The lists boruta_signif1 and boruta_signif2 were combined
feature_selected = c(boruta_signif1,boruta_signif2)The dataframe was filtered and only the the features which have some relation with the dependent variable were kept.
covidtable = covidtable[,c("Severity",feature_selected)]Boruta() and getSelectedAttributes() applied to the data; however, this time withTentative was kept false which means only the features with significant correlation with “Severity” were kept.
boruta_output <- Boruta(Severity ~ ., data=na.omit(covidtable), doTrace=0)
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = FALSE)The dataframe was filtered again and this time only the the features which have significant correlation with the dependent variable were kept.
covidtable = covidtable[,c("Severity",boruta_signif)]The list of the features selected was saved as a txt file.
write.csv(names(covidtable),"featuresSelected.txt", row.names = FALSE)The list is read from local directory and loaded into R. However the 27 shortlisted variables also failed to converge the model. Hence first 11 variables from the list were chosen.
featuresSelected = read.csv("featuresSelected.txt")
featuresSelected = unlist(featuresSelected)
featuresSelected = featuresSelected[1:12]List of the selected features
featuresSelected## x1 x2 x3 x4 x5 x6 x7
## "Severity" "ARG1" "CCL3" "CCL5" "CDHR2" "CYB561" "CYSLTR1"
## x8 x9 x10 x11 x12
## "EGR2" "FCER1A" "GALNT6" "GPR68" "GRB10"
Filtering the data frame using the selected features list
covidtable = covidtable[,c(featuresSelected)]Now our data is clean.
2. Training and Tuning
Test Train division
The seed is set which means random numbers generated each time will be same. It helps evaluate the model cause the train and test data set remains same every time. Then sample size is set to 80% of the total observation and random indexes are generated. Later, the observations with above indexes are a assigned to train set where others are assigned to test set.
set.seed(222)
sample_size = round(nrow(covidtable)*.80)
index <- sample(seq_len(nrow(covidtable)), size = sample_size)
train <-covidtable[index, ]
test <- covidtable[-index,]k-fold cross-validation
train.control <- trainControl(method = "cv", number = 5)Training the model
The logistic regression model (glm) is used for training.
model <- train(Severity ~., data = train, method = "glm",
trControl = train.control)3. Model Validation
model## Generalized Linear Model
##
## 101 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 81, 81, 81, 81, 80
## Resampling results:
##
## Accuracy Kappa
## 0.8019048 0.6021277
The model gave 80% accuracy on the train data set which is a good proportion given that we have just 101 observations in train data set and out of which 80% are used for training the model and 20% for its validation as we are using k-fold cross-validation with k = 5.
4. Model Interpretation
summary(model)##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.81671 -0.37507 0.00191 0.32658 2.01942
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6764 1.5480 1.083 0.2788
## ARG1 1.9524 4.8114 0.406 0.6849
## CCL3 -5.1250 2.2407 -2.287 0.0222 *
## CCL5 4.0904 5.0386 0.812 0.4169
## CDHR2 8.8415 5.9049 1.497 0.1343
## CYB561 -5.3571 2.8492 -1.880 0.0601 .
## CYSLTR1 -4.1886 2.3155 -1.809 0.0705 .
## EGR2 0.2514 2.4488 0.103 0.9182
## FCER1A 4.1301 2.7033 1.528 0.1266
## GALNT6 0.6902 3.2337 0.213 0.8310
## GPR68 -6.8297 5.9687 -1.144 0.2525
## GRB10 6.1498 4.4722 1.375 0.1691
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 139.927 on 100 degrees of freedom
## Residual deviance: 56.569 on 89 degrees of freedom
## AIC: 80.569
##
## Number of Fisher Scoring iterations: 7
Coefficients
Above shows the summary of our linear regression model. Where there is evidence of “ARG1”, Severity (“ICU” versus “NonICU”) increases by “6.0040”.
Where there is evidence of”CDHR2”, the odds of Severity increases by “7.465”.
Similarly, other variables can be interpreted in a similar fashion.
Null Deviance and Residual Deviance
The difference between the two tells us that the model is a good fit. Greater the difference better the model. Null deviance is the value when you only have intercept in your equation with no variables and Residual deviance is the value when you are taking all the variables into account. It makes sense to consider the model good if that difference is big enough. In our case “Null deviance: 139.927” and “Residual deviance: 62.551” show that our model is a good fit.
Fisher Scoring Iterations
This is the number of iterations to fit the model. The logistic regression uses an iterative maximum likelihood algorithm to fit the data. The Fisher method is the same as fitting a model by iteratively re-weighting the least squares. It indicates the optimal number of iterations. Our model gave a “Fisher Scoring iterations: 7” which means the model was able to converge in 7 iterations.
5. Predictions
The steps below first predict the Severity using the test data set using the model above. Then those predictions are compared to the actual values and overall accuracy of the prediction is calculated at the end.
test$Predicted = predict(model,test[,c(2:ncol(test))])
Error <- mean(test$Predicted != test$Severity)
print(paste('Accuracy',round((1-Error)*100,2), "%" ) )## [1] "Accuracy 88 %"
Confusion Matrix
A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone can be misleading. Calculating a confusion matrix can give a better idea of what classification model is getting right and what types of errors it is making.
test$Predicted = as.factor(test$Predicted)
CM <- confusionMatrix(data=test$Predicted, reference = test$Severity)
CM## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9 1
## 1 2 13
##
## Accuracy : 0.88
## 95% CI : (0.6878, 0.9745)
## No Information Rate : 0.56
## P-Value [Acc > NIR] : 0.0006695
##
## Kappa : 0.7541
##
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.8182
## Specificity : 0.9286
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.8667
## Prevalence : 0.4400
## Detection Rate : 0.3600
## Detection Prevalence : 0.4000
## Balanced Accuracy : 0.8734
##
## 'Positive' Class : 0
##
The confusion matrix above shows ” Pos Pred Value : 1.0000 ” and “Neg Pred Value : 0.8750” which means the model was able to predict positive, which is “1” in our case, correctly 100% of the time and negative which is “0” in our case 87% of the time. Hence our model isn’t biased and is predicting both classes of data with high accuracy.
Underfitting/Overfitting
The model gave 80% accuracy on the train data set and 92% on the test data set. This shows the mode is neither under-fitted nor over-fitted. Moreover the performance of the model could be improved if we had a larger number of observations.
6. Conclusion
We predicted the Severity of a patient using his/her genome sequences. The whole process of prediction consisted of data cleaning, training the model and asses its performance. We started with over 18000 features but after following numerous steps for features selection we were left with just 11. This step helped in letting the model converge and give a reasonable accuracy for both train and test data sets. The overall performance of the model can be improved if we have a larger number of observations compared to just 126. In a nutshell, we were able to predict the severity of a patient with 92% accuracy which is quite significant number.